An evaluation of the replicability of analyses using synthetic health data

Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.

Participants in the control "chemotherapy-only" arm (FOLFOX, FOLFIRI or hybrid regimen without Cetuximab) were analyzed in the published secondary analysis, which consisted of 1,543 patients.Presentation with acute obstruction of the bowel is a known risk factor for poor prognosis in patients with colon cancer [3], [4].The main objective of this secondary analysis was to assess the role of obstruction presentation as an independent risk factor for predicting outcomes in patients with stage III colon cancer.The primary endpoint of the in the published secondary analysis was disease free survival (DFS), and the secondary endpoint was overall survival (OS), and both DFS and OS were censored at five years.
The covariates in the published secondary analysis comprised of three types of variables: 1) Baseline demographics, including age, sex, and baseline BMI, 2) Baseline Eastern cooper-active oncology group (ECOG) performance score that describes patients' level of functioning in terms of their ability to care for themselves, daily activity and physical ability, and 3) Baseline cancer characteristics, including clinical T stage, lymph node involvement, histologic status, and Kirsten rat sarcoma virus (KRAS) biomarker status.

CCHS
For the cardiovascular health dataset, the outcome considered was the binary variable for CVH status and the primary exposure of interest was gender.The model included other relevant predictors (age, education, household income, household size, and whether the participant is a new immigrant or not) which were selected based on previous studies [5].
Cardiovascular diseases (CVD) continue to represent the leading cause of mortality and morbidity amongst women and men worldwide [6].Biological differences between the sexes such as anatomical and physiological variations in coronary arteries and autonomic nervous system, alter the development and progression of CVD [7].However, environment and lifestyle [8] as well as individuals' identity, roles, and relations in society may play an important role.These characteristics are gendered in the way that they affect males and females differently and evolve through early life to adulthood [9].The specific model we evaluated is a classification version of the regression model predicting CVH [10].

DCCG
This is a prospectively maintained Danish Colorectal Cancer Group (DCCG) database including all Danish patients with a first-time diagnosis of right-sided colonic cancer between 2001 and 2018 [11].The main outcome that we model is medical complications after surgery.The covariate of interest is sex.
The literature about post-operative outcomes in colon surgery shows different results regarding the effect of gender on post-operative complications.However, a snapshot prospective audit conducted by the European Society of Colo-Proctology (ESCP) provided real-time international data [12].The study showed higher rates of post-operative complications in men (OR 1.5 CI 95% [1.2-1.8],p<0.001) who underwent right-sided colon resection for colon cancer.This has been confirmed by another snapshot prospective audit on left colon resection [13] which reported higher rate of post-operative complications in men (OR 1.46 CI 95% [1.16-1.84],p= 0.001).It is also interesting to see that the rate of conversion from laparoscopic to open surgery is higher in male gender (OR 1.50 CI 95% [1.17-1.93],p= 0.001).The hormonal effect of female hormones might have a protective effect as shown by lower rates of post-operative infection after elective colorectal surgery [14].Despite these findings, men take shorter time to physically recover after colorectal surgery [15].The reasons of disparity in outcome between men and women need to be investigated further.

N0147 Results
Figure 9: Decision agreement and estimate agreement for the N0147 colon cancer dataset using the CTGAN method. 12/24 The other covariates included in the model were: Age, ASA score, Localization of tumor, Procedure, Pathological T stage, Pathological N stage, Pathologically shown total number of removed lymph nodes, Pathologically shown total number of lymph nodes with metastasis, and Unplanned intra-operative adverse event (UIAEs).

Figure 1 :
Figure 1: Decision agreement and estimate agreement for the DCCG colon cancer dataset using the sequential synthesis method.

Figure 2 :
Figure 2: Standardized difference and confidence interval overlap for the DCCG colon cancer dataset using the sequential synthesis method.

Figure 3 :
Figure 3: The bias and power for the Danish (DCCG) colon cancer breast cancer dataset using the sequential synthesis method.

Figure 4 :Figure 5 :
Figure 4: The coverage and empirical SE for the Danish (DCCG) colon cancer breast cancer dataset using the sequential synthesis method.

Figure 6 :
Figure 6: Standardized difference and confidence interval overlap for the CCHS dataset using the sequential synthesis method.

Figure 7 :
Figure 7: The bias and power for the CCHS dataset using the sequential synthesis method.

Figure 8 :
Figure 8: The coverage and empirical SE for the CCHS dataset using the sequential synthesis method.

Figure 10 :
Figure 10: Standardized and confidence interval overlap for the N0147 colon cancer dataset using the CTGAN method.

Figure 11 :
Figure 11: The bias and power for the N0147 colon cancer dataset using CTGAN.

Figure 12 :Figure 13 :
Figure 12: The coverage and empirical SE for the N0147 colon cancer dataset using CTGAN.

Figure 14 :
Figure 14: Standardized difference and confidence interval overlap for the DCCG colon cancer dataset using the CTGAN method.

Figure 15 :
Figure 15: The bias and power for the Danish (DCCG) colon cancer dataset using CTGAN.

Figure 16 :Figure 17 :
Figure 16: The coverage and empirical SE for the Danish (DCCG) colon cancer dataset using CTGAN.

Figure 18 :
Figure 18: Standardized difference and confidence interval overlap for the CCHS dataset using the CTGAN method.

Figure 19 :
Figure 19: The bias and power for the CCHS dataset using CTGAN.

Figure 20 :
Figure 20: The coverage and empirical SE for the CCHS dataset using CTGAN.

Table 1 :
Average membership disclosure values for the three datasets using the CTGAN generative model.